Welcome to the Wonderland of Data Science!
Our analysis uses data from 20,036 responses in the Kaggle’s 2020 survey. Although the survey is limited to the Kaggle community, we believe the insights about these Kagglers will not only apply to our community, but also generalize to a larger data science community in which we better understand the roadmap that people are taking, the tools that people are developing, and the language that people are using to communicate about data science nowadays. We decide to tell our story by means of radar chart as we want to focus on the behavioral tendencies of the respondents and compare the trends between different detected groups of data scientists within the Kaggle community.
Just like Alice going on an adventure to the Wonderland, no matter where you are on the data science path, you will definitely “grow up” and survive in this confusing yet very exciting world. We hope our analysis will serve to guide you through the world of data science and help you stay up to date with the latest trends in the field. Now, let’s get started!
In this analysis, we aim to detect different groups of data scientists within the Kaggle community using clustering technique. Questions 1, 2, 3, 4, 6, and 15 are selected as input features for clustering since demographics, education, and years of experience are all fixed properties representative of an indivdual. The input questions are as below:
Given the clusters, individuals excluded from the survey will be able to identify the group most relevant to them based on their background information and gain useful insights regarding data science career tracks, technical skills, and learning resources through our analysis.
Since we have 6 input questions corresponding to 6 dimensions, summarizing and visualizing data can be difficult while clustering might be ineffective in high-dimensional space. To overcome these limitations, we propose first using correspondence analysis (CA), an extension of principal component analysis suited to explore relationships among qualitative variables. This technique is generally used to analyze a data set from survey to identify the associations between variable categories. Since our data contains more than two categorical variables, we choose to use multiple correspondence analysis (MCA) as it can be applied to input of multiple dimensions. However, MCA has a disadvantage that it tends not to explain all the variance while looking at associations. For the purpose of this analysis, we select the first two dimensions which altogether account for 10.5% of the total variances in the survey’s answers.
Our goal is to better understand the specific communities represented in the survey. Thus, we propose using K-means clustering, one of the simplest and most popular clustering that aims to group similar data points together and discover underlying patterns. To process the learning data and detect clusters, K-means identifies k number of centroids, and then allocates every data point to the nearest cluster, while keeping the centroids as small as possible. After examining different possible values of k, we decide to pick k = 3. This will help the model to avoid overfitting while ensuring its descriptive performance. Contextually, we believe 3 is a suitable number of groups of similar respondents that will allow us to dive deep into the survey dataset and tell the diverse stories of data scientists from around the world.
Now that we have found the clusters of similar respondents from the survey, we will next examine the types of data scientists within the Kaggle community that these clusters represent. Since our variables are all categorical, we choose to calculate the mode to measure the central tendency of each variable. Below is the table of cluster distribution with the corresponding mode values of the input questions:
| Group | Count | Age | Continent | Degree | Years of Coding | Years of ML |
|---|---|---|---|---|---|---|
| Experts | 5966 | 30-34 | Europe | Master’s degree | 5-10 years | 2-3 years |
| Junior Players | 12039 | 22-24 | Asia | Bachelor’s degree | 1-2 years | Under 1 year |
| Amateur Outsiders | 2031 | 25-29 | Asia | Bachelor’s degree | I have never written code | NA |
Overall, there are three representatives in the Kaggle community:
Alice and the Rabbit Hole
The first group consists of individuals in the age of 25-29, mostly with a Bachelor’s degree but neither coding nor machine learning experience. They seem to be the amateur outsiders with some interest in data science and would like to know more or enter the field. This group is the smallest of the three groups, accounting for roughly 10% of the sample.
Alice at the Mad Tea Party
The second group consists of individuals in the age of 22- 24, mostly with a Bachelor’s degree and some years of coding but not much machine learning experience. They seem to be fresh graduates or beginners in the data science world who already know what’s in there and are on the way of discovering more about the path. This group is the largest among the three groups, accounting for roughly 60% of the sample.
Alice Waking Up at the Riverbank
The third group consists of individuals in the age of 30 – 35, mostly with a Master’s degree and many years of both coding and machine learning experience. We can call them the experts of the game as they seem to be much more proficient in data science more than anyone else. This group is the second largest among the three groups, accounting for roughly 30% of the sample.
This observation closely matches our expectation about the current roles of the three groups: the experts are mostly scientists and researchers; the junior players are mainly students or working at various positions within the data science field; and the amateur outsiders are either unemployed, not working in the field (others), or even students as they could be pursuing a Master’s or a Bachelor’s degree in another field.
In terms of company size, the three groups all have similar shape in which ‘0-44 employees’ is the most popular size. However, the experts are likely to work for companies with bigger scope as they tend to have more individuals responsible for data science workloads at business, as opposed to the juniors and outsiders who might be still in school or working for compaies with smaller scope.
The experts mostly work at places where machine learning methods have been well established while the juniors work at the place where machine learning method are being explored, and the outsiders tend to work at places where people do not exploit machine learning. Intuitively, this makes perfect sense as the experts are the ones having the most years of machine learning experience, while the juniors and outsiders have not had much experience with machine learning.
As indicated in Q5, the experts are likely to be data scientists and researchers, therefore their work will also focus on building/running prototypes to explore applying machine learning to new areas in addition to supporting business decisions with their data analysis. The junior players and amateur outsiders are expected to be entry-level data analysts or business analysts whose responsibilities simply focus on analyzing and understand data to support business and products.
How much can you make with a data science job? This question is always of people’s interest, especially those who just enter the field or are thinking about switching to data science from another field. According to the results in Q24, the beginners are likely to be paid up to $50 but no more than $150K. Fortunately, the salary for the experts is expected to be in a larger range from 0 – $50K to $100K – $150K or even more than $200K, implying that compensation is likely to increase as you improve your skills and move up the ladder over time.
Python stands out to be the most popular choice of programming language used on a regular basis according to all groups. Since Python has hundreds of different libraries and frameworks focused on data analytics and machine learning, it is no surprise that data scientists love to use it. Besides Python, SQL is also a popular language to data science insiders (the experts & the juniors) since it is very common and widely used in data analytics as part of business intelligence process.
It is very interesting that the outsiders also use R in addition to Python. This also makes sense as R is exclusively powerful for statistical analysis and modelling, and usually helpful for people who want to leverage their foundation of data science. Python, on the other hand, is a better use for machine learning, deep learning, NLP – the more advanced topics of data science.
As a result of Q7, Q8 clearly indicates Python is the language that everyone should prioritize learning first in order to communicate effectively in the data science world. Ideally, an individual can master other languages such as SQL and R but in the long run, Python is the way to go.
Knowing the results of Q7 & Q8, it is no surprise that Jupyter is selected to be the most common IDE, while Visual Studio Code and Pycharm come second since the three environments all support Python very well. Since R is largely used by the outsiders, R-studio and Jupyter are equally popular to this group.
Interestingly, we don’t have any responses from the outsiders for questions related to specific tools/packages that are used in data science. The responses from the other two groups have very similar shapes, though the experts’ seem to be more diverse than the juniors’. A recap of the responses is as below:
One thing we can notice from these radar charts in this section is that the junior team seems to have a better idea of the programming languages and IDEs that they want to focus on than the other two groups. For experts, it is possible that they want to challenge themselves and explore different available tools so that they could implement their products more efficiently. For outsiders, they might have been trying a couple different things to learn more about data science and collecting clues to figure out the best way to land a job in the field. Depending on what you want to do and the level you are at, you can identify the tools/packages most suitable for you and prioritize mastering them accordingly.
The top three sources for learning data science are Coursera, Udemy, Kaggle Learn Courses for all groups. It is interesting to see that for the insider groups, university is also a notable platform.
The three groups’ radar charts show the trends as an individual moves further in the data science field. In the early stage, people use basic statistical software such as Excel, Google Sheet, and later on gradually switch to use local development environments like Rstudio or Jupyter more often.
The top three media sources for data science topics are Kaggle, YouTube, and blogs. The trend indicates that Youtube works best for the amateur outsiders (presumably for its visualization and interactivity), while journal publications and blogs are more popular for the experts.
The key takeaways from our findings are:
So the journey to discovering the world of data science has come to an end. Hopefully after reading this analysis, you could gain some meaningful insights into the field and get motivated to learn data science. Thank you for reading and Kaggle for giving us the chance to explore this fun dataset. Hope you all enjoy it and will find your own way to sucess in this fuzzy yet fantastic world. All the best!